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Abstract. We present a fast algorithm for linear least squares problems governed by hierarchi- 
cally block separable matrices. Such matrices are generally dense but data-sparse and can describe 
many important operators including those derived from the Green's functions of non-oscillatory el- 
liptic partial differential equations. The algorithm is based on a recursive skeletonization procedure 
£N| that exposes this sparsity and solves the dense least squares problem as a much larger, equality- 

y—( constrained, sparse one. It relies on a sparse QR factorization coupled with iterative weighted least 

squares methods. In essence, our scheme consists of a direct component, comprised of matrix com- 
^s^J pression and factorization, followed by an iterative component to enforce certain equality constraints. 

In practice, the iteration typically converges in no more than two steps and can therefore also be 
viewed as direct. We illustrate the performance of the method for both overdetermined and under- 

Q determined systems, with an emphasis on radial basis function approximation and efficient updating 
and downdating. 
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1. Introduction. The method of least squares is a powerful technique for the 
approximate solution of overdetermined systems and is often used for data fitting 
and statistical inference in applied science and engineering. In this paper, we will 
pj primarily consider the linear least squares problem 

(1.1) min||Ac — 6||, 

X 

where A e C MxN is dense and full-rank with M > N, x € C N , b € C M , and || • || is 
the Euclidean norm. Formally, the solution is given by 

i£ (1.2) x = A+b, 

£\\ where A + is the Moore-Penrose pseudoinverse of A, and can be computed directly via 

the QR decomposition at a cost of 0{MN 2 ) operations [40]. This can be prohibitive 
when M and N are large. If A is structured so as to support fast multiplication, 
then iterative methods such as LSQR [33] or GMRES [37] US] present an attractive 
alternative. However, such solvers still have several key disadvantages when compared 
with their direct counterparts: 

(i) The convergence rate of an iterative solver can depend strongly on the con- 
ditioning of the system matrix, which, for least squares problems, can sometimes be 
very poor. In such cases, the number of iterations required, and hence the compu- 
tational cost, can be far greater than expected (if the solver succeeds at all). Direct 
methods, by contrast, are robust in that their performance does not degrade with 
conditioning. Thus, they are often preferred in situations where reliability is critical. 
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Table 1.1 

Examples of radial kernels <f>{r) admitting HBS representations: zeroth order Hankel 

function of the first kind; Kq, zeroth order modified Bessel function of the second kind. 



Type 


Name 


Kernel 


Notes 






2D 


3D 






Laplace 


log r 


l/r 




Green's function 


Helmholtz 




e lkr /r 


k not too large 




Yukawa 


Ko(kr) 


e -kr j r 






Polyharmonic 




r 2n-l 


n = 1,2,3,... 


Radial basis function 


Multiquadric 
Inverse multiquadric 


vV 2 4 
1/vV 2 


-C 2 
+ C 2 


c not too large 



(ii) Standard iterative schemes are inefficient for multiple right-hand sides. With 
direct solvers, on the other hand, following an expensive initial factorization, the 
subsequent cost for each solve is typically much cheaper (e.g., only 0(MN) work to 
apply the pseudoinverse given precomputed QR factors) . This is especially important 
in the context of updating and downdating as the least squares problem is modified 
by adding or deleting data, which can be viewed as low-rank updates of the original 
system matrix. 

In this paper, we present a fast direct least squares solver for a class of structured 
dense matrices called hierarchically block separable (HBS) matrices. Such matrices 
were introduced by Gillman, Young, and Martinsson 24 , and possess a nested low- 
rank property that enables highly efficient data-sparse representations. The HBS 
matrix structure is closely related to that of JriF- and J^f 2 -matrices [33J 132 [35] and 
hierarchically semiseparable (HSS) matrices [T5J [TBI 151] , and can be considered a 
generalization of the matrix features utilized by multilevel summation algorithms like 
the fast multipole method (FMM) [2"51 129] . Many linear operators are of HBS form, 
notably integral transforms with asymptotically smooth radial kernels. This includes 
those based on the Green's functions of non-oscillatory elliptic partial differential 
equations [5]. Some examples are shown in Table [TTT] we highlight, in particular, the 
Green's functions 

<M {r) = -, (r) = r, 

r 

for the Laplace and biharmonic equations, respectively, in 3D, and their regulariza- 
tions, the inverse multiquadric and multiquadric kernels 

0imq (r) = . , 0mq (r) = Vr 2 + c 2 , 

Vr + c z 

respectively (for c not too large). The latter are well-known within the radial basis 
function (RBF) community and have been used to successfully model smooth surfaces 
[T31 . Also of note is the 2D biharmonic Green's function, the so-called thin plate 
spline 

(1.3) (j) T ps(r) = r 2 \ogr, 

which minimizes a physical bending energy |22j . For an overview of RBFs, see [101 144] . 

Previous work on HBS matrices exploited their structure to build fast direct 
solvers for the square M = N case 24, 38, 42 (similar methods arc available for other 
structured formats). Here, we extend the approach of [35] to the rectangular M > 
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Table 1.2 

Asymptotic complexities for the least squares solver when applied to the operators in Table [i~7] 
on data embedded in a d-dimensional domain: M and N, matrix dimensions; M > N . 



d Precomputation Solution 

~L 0(M + N) 0(M + N) 

2 0(M + N 3 / 2 ) 0(M + AT log AT) 

3 0{M + N 2 ) 0(M + A" 4 / 3 ) 



iV case. Our algorithm relies on the multilevel compression and sparsity-revealing 



embedding of |38j . and recasts the (unconstrained) dense least squares problem (1.1 1 
as a much larger, equality-constrained, sparse one. This is solved via a sparse QR 
factorization coupled with iterative weighted least squares methods. For the former, 
we use the SuiteSparseQR package J20| by Tim Davis, while for the latter, we employ 
the iteration of Barlow and Vemulapati [5 , which has been shown to converge in no 
more than two steps for problems that are not too ill-conditioned. Thus, our solver 
can, in practice, be viewed as a direct method. 

It is useful to divide our algorithm into two phases: a precomputation phase, 
comprising matrix compression and factorization, followed by an iterative solution 
phase using the precomputed QR factors. Clearly, for a given matrix, only the solution 
phase must be executed for each additional right-hand side. Table [L2] lists asymptotic 
complexities for both phases when applied to the operators in Tabic |1.1| on data 
embedded in a d-dimcnsional domain for d = 1, 2, or 3. Although the estimates 
generally worsen as d increases, the solver achieves optimal 0(M + N) complexity 
for both phases in any dimension in the special case that the source (column) and 
target (row) data are separated (i.e., the domain and range of the continuous operator 
are disjoint). This may have applications, for example, in partial charge fitting in 
computational chemistry [5J 123j . 

Our methods can also generalize to underdetermined systems (M < N) when 
seeking the minimal norm solution in L 2 , i.e., the equality-constrained least squares 
problem 

(1.4) min ||a;||, 

Ax=b 



provided that the solution, which is also given by ( 1.2 ), is not too ill-conditioned with 
respect to A. 

Fast direct least squares algorithms have been developed in other structured ma- 
trix contexts as well, in particular within the M'- and HSS matrix frameworks using 
various structured orthogonal transformation schemes Q31 [2TJ [35] . Our approach, 
however, is quite different and explicitly leverages the sparse representation of HBS 
matrices and the associated sparse matrix technology (e.g., the state-of-the-art soft- 
ware package SuiteSparseQR) . This has the possible advantage of producing an algo- 
rithm that is easier to implement, extend, and optimize. For related work on other 
structured matrices including those of Toeplitz, Hankel, and Cauchy type, see, for 
instance, [3TJ [3§1 071 [52] and references therein. 

The remainder of this paper is organized as follows. In the next two sections, we 
collect and review certain mathematical preliminaries on HBS matrices (section[2]) and 
equality-constrained least squares problems (section |3j). In section |4j we describe our 
fast direct algorithm for both overdetermined and underdetermined systems. Com- 
plexity estimates are given in section [5j while section [6j discusses efficient updating 
and downdating in the context of our direct solver. Numerical results are reported 
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column 




Fig. 2.1. A block separable matrix has low-rank off-diagonal block rows and columns (black); 
its diagonal blocks (white) can in general be full-rank. 

in section [7] Finally, in section |8j we summarize our findings and end with some 
generalizations and concluding remarks. 

2. HBS matrices. In this section, we define the HBS matrix property and 
discuss algorithms to compress such matrices and to sparsify linear systems governed 
by them. We will mainly follow the treatment of [35] , extended to rectangular matrices 
in the natural way. 

Let A G <C MxN be a matrix viewed with pxp blocks, with the ith row and column 
blocks having dimensions rrii, rii > 0, respectively, for i = 1, . . . ,p. 

Definition 2.1 (block separable matrix [24] ). The matrix A is block separable 
if each off-diagonal submatrix Ay can be decomposed as the product of three low-rank 
matrices: 

(2.1) .1, hr i , j, 

where U G C m '**S G C fe - X ^, and Rj € C k i xn i , with (ideally) k\ < to, and 

fc j < »j> / or i,J = 1, • • • ,P- 

Definition 2.2 (off-diagonal block rows and columns). The ith off-diagonal 
block row of A is the submatrix [An ■ ■ ■ A,-(j_i) Aj/j+i) • • • Ai p ] consisting of the ith 
block row of A with the diagonal block An removed; the off-diagonal block columns 
are defined analogously. 



Clearly, the block separability condition (2.1 1 is equivalent to requiring that the 
off-diagonal block rows and columns have low rank (Fig. 2.1 1. Observe that if A is 
block separable, then it can be written as 

(2.2) A = D + LSR, 

where D — diag(Aji), L = diag(Lj), R = diag(i?^), and S = is dense with 

S lt = 0. 

Let us now define a tree structure on the row and column indices I = {1, . . . , M} 
and J = {1, . . . , N}, respectively, as follows. Associate with the root of the tree the 
entire index sets / and J. If a given subdivision criterion is satisfied (e.g., based on 
the sizes |7| and \ J\), partition the root node into a set of children, each associated 
with a subset of I and J such that they together span the whole sets. Repeat this 
process for each new node to be subdivided, partitioning its row and column indices 
among its children. In other words, if we label each tree node with an integer i and 
denote its row and column index sets by Jj and Ji, respectively, then 

h = Ij, Ji = Jj, 

j'Gchild(i) jGchild(i) 

where child(i) gives the set of node indices belonging to the children of node i. Fur- 
thermore, we also label the levels of the tree, starting with level for the root at the 



coarsest level to level A for the finest-level leaves; see Fig. 2.2 for an example 
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Fig. 2.2. An example of a tree on the row and column index sets with depth A = 3. The root 
(node 1) contains all indices, which are hierarchically partitioned among its children. 



Remark. Although we require that the number of row and column partitions be 
the same, we do not impose that |7j| = \Ji\ for any node i. Indeed, it is possible for 
one of these sets to be empty. 

Evidently, the tree defines a hierarchy among row and column index sets, each 
level of which specifies a block partition of the matrix A. 

Definition 2.3 (HBS matrix [21]). The matrix A is HBS if it is block separable 
at each level of the tree hierarchy. 

HBS matrices arise in many applications, for example, when discretizing the 
kernels in Table |1.1| (up to a specified numerical precision) , with row and column 
indices partitioned according to an octree-type ordering on the corresponding data 
[211 [29, 34, 38;, which recursively groups together points that are geometrically collo- 
cated [46] . 

2.1. Multilevel matrix compression. We now review algorithms for comput- 



ing the low-rank matrices in (2.1 1 characterizing the HBS form. Our primary tool for 
this task is the interpolative decompositon (ID) [17] . 

Definition 2.4 (ID). AnID of a matrix A e C mx ™ with rank k is a factorization 
A = BP. where B 6 c mxfc consists of a subset of the columns of A and P E C fexrl 
contains the k x k identity matrix. We call B and P the skeleton and interpolation 
matrices, respectively. 

As stated, the ID clearly compresses the column space of A, but we can just 
as well compress the row space by applying the ID to A T . Efficient algorithms for 
adaptively computing an ID to any specified precision are available [El [41] [50] . 

Definition 2.5 (row and column skeletons). The row indices corresponding to 
the retained rows in the ID are called the row or incoming skeletons; the column indices 
corresponding to the retained columns are called the column or outgoing skeletons. 

A multilevel algorithm for the compression of HBS matrices follows naturally. For 
simplicity, we describe the procedure for matrices with a uniform tree depth (i.e., all 
leaves are at level A), with the understanding that it extends easily to the adaptive 
case. The following scheme is known as recursive skeletonization (Fig. |2.3[ ): 

1. Starting at the leaves of the tree, extract the diagonal blocks and compress 
the off-diagonal block rows and columns using the ID to a specified precision e > 
as follows. For each block i — 1, . . . ,p, compress the row space of the zth off-diagonal 
block row and call Li the corresponding row interpolation matrix. Similarly, for each 
block j, compress the column space of the jth off-diagonal block column and call Rj 
the corresponding column interpolation matrix. Let S be the skeleton submatrix of 
A, with each off-diagional block £y for i ^ j defined by the row and column skeletons 
associated with Lj and Rj, respectively. 

2. Since the off-diagonal blocks SV,- are submatrices of the corresponding Ay, 
the compressed matrix S is HBS and so can itself be compressed in the same way. 
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Fig. 2.3. Schematic of recursive skeletonization, comprising alternating steps of compression 
and regrouping via ascension of the index tree. The diagonal blocks (white and gray) are extracted 
at each level; they are shown here only to illustrate the regrouping process. 



proxy 




| neighbor 
^ i ■ well-separated 
Proxy 



Fig. 2.4. The interaction of a region B with a distribution of exterior sources or targets (left) 
can be decomposed into neighboring and well-separated contributions. By representing the latter via 
a proxy surface F (center), the matrix dimension to compress against for the block row or column 
corresponding to B (right) can be reduced to just the number of points in the neighboring region plus 
a constant set of points on F. 



Thus, move up one level in the tree, regroup the matrix blocks accordingly, and repeat. 
The result is a telescoping matrix representation of the form 



(2.3) A a + [• ■ ■ + L< a > (z?« + L^D^R^A ■ ■ ■] R^ 



cf. (2.2), where the superscript indexes the tree level / = 0, . . . , A, that is accurate to 



precision approximately e. The algorithm is automatically adaptive in the sense that 
the compression is more efficient if lower precision is required [Ml I2H] • 

Remark. For the kernels in Table which obey some form of Green's theorem 
(at least approximately), it is possible to substantially accelerate the preceding algo- 
rithm by using a proxy surface to capture all far-field interactions (Fig. 2.4). The key 
idea is that any such interaction can be represented in terms of some equivalent den- 
sity on an appropriate local bounding surface, which can be chosen so that it requires 
only a constant number of points to discretize, irrespective of the actual number of 
points in the far field or their detailed structure. This observation hence replaces each 
global compression step with an entirely local one; see [TTJ [57J EH 02 for details. 



FAST LEAST SQUARES FOR HIERARCHICAL MATRICES 



7 



2.2. Structured sparse embedding. For M = N, the decomposition (2.3 1 
enables a highly structured sparse representation of the linear system Ax = b as 
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(2.5a) 




x< A > = X, 


X U) = R (i+V x (i+i) 


for I = A - 


-1,- 


.,0, 


(2.5b) 






n) = D (D x (i) + L m y (i) 


for 1 = 1,. 


..,A 


- 1. 



This expanded embedding clearly exposes the sparsity of HBS matrix equations and 
permits the immediate application of existing fast sparse solvers (such as UMFPACK 
p] as in [38]). 

If M > N, however, then we have to deal with the overdetermined problem 
(1.1), and ( 2.4 ) must be interpreted somewhat more carefully. In particular, the 



identities (2.5) still hold, so only the first block row of (2.4) is to be solved in 
the least squares sense. Thus, denoting the first block row of the sparse matrix 
in (2.4) by E and the remainder (i.e., its last 2A bl ock rows) by C, and defining 

(i) 



x = (x^ x \ y( x \ . . . , j/ 1 -*, a;( ') T , the analogue of (2.4) for (1.1) is the equality 
constrained least squares problem 



(2.6) 



min ||Ex — 611 

Cx=0 



where both E and C are sparse. It is easy to see that C has full row rank. 

Similarly, if M < N and we seek to solve the underdetermined system (1.4), then 
the corresponding problem is 



(2.7) 
where 



min IiX , 

Mx=bi 



M 



E 

C 



Ii = [ I o 



i.e., M is the entire sparse matrix on the left-hand side of (2.4), which also has full 



row rank; bi is the right-hand side of (2.4); and Ii is an operator that picks out the 



first block row of the vector on which it acts. 

3. Equality-constrained least squares. We now turn to the solution of linear 
least squares problems with linear equality constraints, with special attention to the 
case that both governing matrices are sparse as in (2.6) and (2.7). For consistency 
with the linear algebra literature, we adopt the notation of Barlow et al. [21 HI |S], 
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which unfortunately conflicts somewhat with our previous definitions; the following 
notation is thus meant to pertain only to this section. 
Hence, consider the problem 



(3.1) 



min \\Ex-f\\ 

Cx=g 



where E £ C mxn and C £ C pxn , with 

rank(C) = p, rank 



E 
C 



so that the solution is unique. Classical reduction schemes for solving (3.1 1, such 



as the direct elimination and nullspace methods, require matrix products that can 
destroy the sparsity of the resulting reduced, unconstrained system [5J 20] . 

3.1. Weighted least squares. An attractive alternative when both E and C 



are sparse is the method of weighting, which recasts (3.1 ) in the unconstrained form 



(3.2) 
where 



min||A(r)a; - 6(t)||, 



A(t) = 



E 
tC 



b(r) 



f 

rg 



for r a suitably large weight. Clearly, asr-> oo, the solution of (3.2) approaches that 



of (3.1 1. The advantage, of course, is that (3.2) can be solved using standard sparse 
techniques; this point of view is elaborated in [4l |4"8"]. 

However, the choice of an appropriate weight can be a delicate matter: if r is 



too small, then (3.2) approximates (3.1) poorly, while if t is too large, then (3.2) can 



become ill-conditioned. An intuitive approach is to start with a small weight, then 
carry out some type of iterative refinement, effectively increasing the weight with each 
step. Such a scheme was first proposed by Van Loan [48], then further studied and 
improved by Barlow et al. [3j [5] ; we summarize their results in the next section. 

3.2. Iterative reweighting by deferred correction. In [5], Barlow and Vem- 
ulapati presented the following deferred correction procedure for the solution of the 



equality-constrained least squares problem (3.1) via the successive solution of the 



weighted problem (3.2 ) with a fixed weight r: 
1. Find 



x (0) = arg min ||A(r)x - 6(r)|| 



and set 



2. For k = 0, 1, 2, . . . until convergence, find 
Ax^ = arg min 



TW (k) +T -i x (k) 
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and update 



x (k+i) = x (k) 



W (k+D =w (k) _CAxW, 
=A( fc )+rV fe+1 ). 



Terminate when the constraint residual || iy( fc+1 ) |j is small. 

Since t is fixed, a single precomputed QR factorization of A{t) can be used for all 
iterations. This algorithm is a slight modification of that employed by Van Loan [48] 
and has been shown to converge to the correct solution for r appropriately chosen, 
provided that (3.1) is not too ill-conditioned [31 [51 05] (see also section 4.3). In 
particular, if implemented in double precision, then for r = £ n ^J c ^ ~ 1-7 x 10 5 , where 
£mach is the machine epsilon, Barlow and Vemulapati [5] showed that their algorithm 
requires no more than two iterations to produce an accurate result. Thus, for a broad 



class of problems of the form (3.1) for which it is reasonable to expect an accurate 
answer, the above scheme can be considered a direct method, which can be made 
explicit by running the iteration for exactly two steps. 



Remark. Although Barlow and Vemulapti [5] considered (3.1) only over the reals, 
there is no inherent difficulty in extending their solution procedure to the complex 
case. 

4. Algorithm. We are now in a position to describe our fast direct method for 
solving the over- and underdetermined systems ( |1.1| and (1.4), respectively, when A 



is HBS. Let e > be a specified compression precision and set r = e 



-1/3 
mach 



1.7 x 10 



5. 



we assume that all calculations are performed in double precision. 

4.1. Overdetermined systems. Let A € C MxN be HBS with AI > N. 
algorithm proceeds in two phases. First, for the precomputation phase: 

1. Compress A to precision e using recursive skeletonization |38) . 

2. Compute a sparse QR factorization of the weighted sparse matrix 

E 

rC 



Our 



where E and C are as defined in section [2721 

This is followed by the solution phase, which, for a given right-hand side b e C M , 



produces an approximate solution (1.2) by using the precomputed QR factors to 



solve the equality-constrained least squares embedding (2.6) via deferred correction 
[5]. Clearly, for a fixed matrix A, only the solution phase must be performed for 
each additional right-hand side. Therefore, the cost of the precomputation phase is 
amortized over all such solves. 

Remark. The algorithm can also easily accommodate various modifications of the 
standard system (1.1), c.; 



(4.1) 



the problem 
min 1 1 Ax - 



+ H\\x\ 



with Tikhonov regularization, which can 


be solved 


usinj 




E 




" b ' 


A = 


A*Ii 


, b = 







rC 








in the weighted formulation (3.2). 
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4.2. Underdetermined systems. Now let A e C JMXiv be HBS with M < N 



Then (1.4) can be solved using the same algorithm as above but with 

A = 



II 


, b = 





tM 







4.3. Error analysis. Let k(A) = \\A\\ \\A + \\ be the condition number of A. Then 



provided that en(A) <C 1, the exact solution of the least squares problem (|2.6[) or (2.7) 



governed by the compressed matrix A e in (2.3 1 is accurate to precision 0(en(A)) with 



respect to the true solution (1.2) (see |38] ) . Furthermore . standard estimates show 
that k(A e ) = 0(k(A)). Therefore, if A is not too ill-conditioned, then neither is A e , 
so the deferred correction procedure will succeed. Thus, we may expect a relative 
error of the order 0(ck(A)). 

5. Complexity analysis. In this section, we analyze the complexity of our 
solver for a representative example: the HBS matrix A defined by a kernel from 
Table [i~T| acting on source and target data distributed uniformly over the same d- 
dimensional domain (but at different densities). We follow the approach of [55] , 

Sort both sets of data together in one hyperoctree (the multidimensional gener- 
alization of an octree, cf. [33]) as outlined in section [2j subdividing each node until 
it contains no more than a set number of combined row and column indices, i.e., 

+ \ Ji\ = 0(1) f° r each leaf i. For each tree level I = 0, . . . , A, let pi denote the 
number of matrix blocks; mi and ni, the row and column block sizes, respectively, 



in the compressed representation (2.3), assumed equal across all blocks for simplic- 
ity; and ki, the skeleton block size, which is clearly of the same order for both rows 
and columns as it depends only on min{C(rn ; ), C(n ; )} (this can be made obvious 
by explicitly considering proxy compression, which produces interaction matrices of 
dimension 0(mi) x 0(ni)). Note that m/ and ni are not the row and column block 
sizes in the tree; they are the result of hierarchically "pulling up" skeletons during 
the compression process (see (iv) below). Moreover, since M ^ N in general, we can 
define an additional level parameter A' < A corresponding to the depth of the tree 
constructed via the same process on only the smaller of the source or target data, e.g., 
on only the source data if M > N. For the remainder of this discussion, we assume 
that M > N. Analogous results can be recovered for M < N simply by switching the 
roles of M and N in what follows. We start with some useful observations: 

(i) By construction, p\{m\ + n\) = M + N ~ M, where m\,n\ = 0(1), so 
p\ ~ M. Similarly, p\> ~ N. 

(ii) Each subdivision increases the number of blocks by a factor of roughly 2 d , 
so pi + i ~ 2 d pi. In particular, p — 1, so A ~ logM and A' ~ (1/d) log TV. 

(hi) For I = X' + 1, . . . , A, ki = 0(1) since min{C(m i ), 0(«/)} = 0(1), while for 
I = 0, .... A', it can be shown [33 that 



(5.1) 



I ifd = l, 

2(<*~ 1 )' ifd>i. 



(iv) The total number of row and column indices at level I < A is equal to the 
total number of skeletons at level I + 1, i.e., pimi = Pini = Pi+\ki + \, so nii,ni ~ 

5.1. Matrix compression. From [171 S3 ISO] , the cost of computing a rank-fc 
ID of an m x n matrix is 0(kmn). If proxy compression is used, then m ~ mi for a 
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Q 


R 






k. 






■ ■ 




■ k 




IBk. 


■ ■■ 




■ k. 




IBk. 


■ ■ 




■ k. 




^Bk. 


■ ■■ 




■ k 




^Bk. 


■ ■ 




■ k. 




^B 






■ k. 







Fig. 5.1. Sparsity structure of QR factors for the structured embedding \2.4\ with M = N, 
where the orthogonal matrix Q is given in terms of the elementary Householder reflectors m . 



block at level I. Therefore, the total cost of matrix compression is 

A A 



mini 

1=0 1=0 

We break this into two sums, one over I = X' + 1, . . . , A and another over I = 0, . . . , A', 
with estimates 



A' 



J2 Pik?~M, 5>zfc? 



Z=A' + 1 

respectively. Hence, 
(5.2) 



N 



if d= 1, 



i s N s[i-i/d) if d>1> 



T 



M + N ifrf=l, 
M + tfHi-i/d) ifd>i. 



5.2. Compressed QR factorization. We consider the QR decomposition of 



a block tridiagonal matrix with the same sparsity structure as that of M in (2.4), 
computed using Householder reflections. This clearly encompasses the factorization 
of A = QR for both over- and underdetermined systems. We begin by studying the 
square M — N case, for which it is immediate that R is block upper bidiagonal (Fig. 
5.1), so the cost of Householder triangularization is 



(5.3) 



E 

1=0 



pimirit 



A 



^Vitf 

1=0 



M + N if d = 1, 

M + N Hi-i/d) if > i 



following the structure of R. The M > N and M < N cases are easily analyzed by 
noting that only the blocks at level A are rectangular, so any nonzero propagation 
during triangularization is limited and both essentially reduce to the square case above 
(Fig. 



5.2). 



The complexities (5.2) and (5.3) of compression and factorization, respectively, 



together constitute the cost of the precomputation phase. 

5.3. Compressed pseudoinverse application. We now examine the cost of 



applying the pseudoinverse to solve the weighted least squares problem (3.2) using the 
precomputed QR factors. For this, we suppose that the solution is determined via the 
equation Rx = Q*b, which requires one application of Q*, assumed to be performed 
using elementary Householder transformations, and one backsolve with R, whose cost 
is clearly on the same order as multiplying by R. Then from the arguments above, it 
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A 


R 








■ ■ 






■ ■■ 






■ ■ 






■ ■■ 






■ ■ 













M > N 



M <N 



Fig. 5.2. Sparsity structure of the triangular QR factor R. for the structured embedding (2. 
in the M > N (left) and M < N (right) cases. 



is evident that both operations have complexity 

a a ( M + N if d = 1, 

;=o i=o [ M + AT 2 ^- 1 ^) ifd>2. 

Since the total number of such problems to be solved is constant for each outer problem 



(2.6), cf. section 3.2 this is also the complexity of the solution phase. 

Remark. One can also use the seminormal equations R*Rx = A*b, which do 
not require the orthogonal matrix Q [5j. However, one step of iterative refinement is 
necessary for stability, so the total cost is three applications of A or A* (one each for 
the original and refinement solves, plus another to compute the residual), and four 
solves with R or R*. In practice, we found this approach to be slower than that 
involving Q by a factor of about four. 

5.4. Some remarks. For all complexities above, the constants implicit in the 
estimates are of the form 0(2 d log" e) for modest a, i.e., they are exponential in the 
dimension and polylogarithmic in the precision (281 129] . 

In the special case that the source and target data are separated, ki = 0(1) for all 
I, so T cm , T qr , and T sv all have optimal complexity 0(M + N) in any dimension. This 
describes, for example, the fitting of atomic partial charges to reproduce electrostatic 
potential values on "shells" around a molecule [6] [23] , the computation of equivalent 
densities in the kernel-independent FMM [53] , and even the calculation of the ID in 
recursive skeletonization 38J, which requires a least squares solve [T7] . 

6. Updating and downdating. We now discuss an important feature of our 
direct solver: its capacity for efficient updating and downdating in response to dy- 
namically changing data. Our methods are based on the augmented system approach 
of [57], extended to the least squares setting, and exploit the ability to rapidly apply 
A + via the solution phase of our algorithm, which we hereafter take as a compu- 



tational primitive. Thus, suppose that we are given some base linear system (1.1), 
focusing for simplicity on the overdetermined case, for which we have precomputed 
a compressed QR factorization of A. We consider the addition and deletion of both 
rows and columns, corresponding to the modification of observations and regression 
variables, respectively. Furthermore, we assume that such modifications are small, in 
particular so that the system remains overdetermined, and accommodate each case 
within the framework of the general equality-constrained least squares problem 

(6.1) min||Ex-f||. 

Cx=g 
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6.1. Adding and deleting rows. To add p r r ows to the matrix A — (a n), and 
correspondingly to the vector b = (&i, . . . , &m) t , in ( 1.1 ), wc simply use (6.1 1 with 



(6.2) 



E 



A 

CL 



0, 



x = x, 



g = 0, 



where CL € C PrXAr describes the influence of the variables x on the new data b+. To 
delete q r rows with indices k\ , . . . , k qt , we add q r degrees of freedom to those rows to 
be deleted in order to enforce strict agreement with those observations as follows: 



x 
x T 



f = b, g = d. 



E = [ A BL ] , C = [ CL I ] 
where BL = (5 kij ) e C Mxq ', CL = (a kij ) e C* xW , and d = {b kl , . . . ,b k J T ; here, 



1 if«=j, 
iii^j 

is the Kronecker delta. The simultaneous addition and deletion of rows can be 
achieved via a straightforward combination of the above: 

A B T _ 
CI 



E 



[ CL I ], x 





X 




' b 


X = 




, f = 






xL _ 







g 



6.2. Adding and deleting columns. To add p c columns to A and hence to 
the vector x = (xi, . . . , a;jv) T , we let 



E 



[A BL\ ] , C = 0, x 



x 



g = 0, 



where BL\ € C Mx Pc describes the influence of the new variables x c + on the data. To 
delete q c columns with indices 1%, . . . , lg a , we add q c "anti- variables" annihilating their 
effects: 



E 



[A BL ] , C = [ CL I], x 



x 
x c 



where BL = (a Uj ) £ C Mx "" and CL = (S U] ) £ 
columns simultaneously, we take 

E = [ A B c + BL ] , C = [ CL / ] , 



f = b, g = 0, 
q c ~xN _ finally to add and delete 



x 
x c 



f = b, g = 0. 



6.3. Simultaneous modification of rows and columns. The general case of 



modifying both rows and columns can thus be treated using (6.1) with 

C 



E 



.4 
CL 



B< 



BL BL 
L>_ 



CL / 
CL / 



and 



x 

X T _ 

x c 



f = 



" b 




' d ' 


. b + . 


. g = 






where D + E C PlXPc describes the influence of x c + on b + , D_ = (cn.) e C PrX<?c for 
CL = (cij) accounts for the effect of column deletion on b + (alternatively, one can zero 
out the relevant columns in CL), and all other quantities are as defined previously. 
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6.4. Solution methods. The augmented system (6.1 1 can be solved by deferred 



correction 5J, where the matrix to be considered at each step is 



A = 



E 




' A 


B 


rC 




C 


D 



for B £ C Mxn , C G C mxJV , and D G C mx ™, where m = p T + q T + q c and n = 
Pc + <Zr + Qc- If to and n are small, then an efficient approach is to compute A + 
by using A + in various block pseudoinverse formulas (see, e.g., |llj ) or by invoking 
Greville's method [12]|30], which can construct A + via a sequence of rank-one updates 
to A + . Alternatively, one can appeal to iterative methods like GMRES [IS], using 
A + as a preconditioner. In this approach, instead of solving 



min 1 1 Ax — b I 



f 



we consider instead, say, the left preconditioned system 

min||BAx - Bb|| 

X 

for an appropriate choice of the preconditioner B G C^ JV+ ™- )x ^ M+m - ) . Hayami, Yin, 
and Ito [37] showed that GMRES converges provided that range(A) = range(B*) and 
range(A*) = range(B). Therefore, suitable choices of B include, e.g., 



" A+ 


C* ' 




' A+ 


C* 


B* 


D* 




B* 


D+ 



(the latter if D is not rank-deficient), for which one application of A + is required per 
iteration. Such methods also have the possible advantage of being more flexible and 
robust. 

7. Numerical results. In this section, we report some numerical results for our 
fast direct solver, compared primarily against LAPACK/ ATLAS [U|49]. We consid- 
ered problems in both 2D and 3D. All matrices were block partitioned using quadtrees 
in 2D and octrees in 3D, uniformly subdivided until all leaf nodes contained no more 
than a fixed number of combined rows and columns (cf. sections 2.1 and [5]), while 
adaptively pruning all empty nodes during the refinement process. The recursive 
skeletonization algorithm was implemented in Fortran and employed as described in 
[55] , Sparse QR factorizations were computed with SuiteSparseQR [5D] through a 
Matlab R2012b (The MathWorks, Inc.: Natick, MA) interface, keeping all orthog- 
onal matrices in compact Householder form. The deferred correction procedure [5J 
was implemented in Matlab. All calculations were performed in double-precision real 
arithmetic on a single 3.10 GHz processor with 4 GB of RAM. In each case, we checked 
our results against those produced by either LAPACK/ ATLAS or a GMRES-based 
iterative solver. 

7.1. Laplace's equation. For benchmarking purposes, we first applied our 
method to Laplace's equation 

Au = in CI e R rf , u = f on 90 

in a simply connected interior domain with Dirichlet boundary conditions, which can 
be solved by writing the solution in the form of a double-layer potential 

dG 



u(x) 



CI I ^ — <r{y) dy for x G Q, 
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3D 
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=> 
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System size (AT) 
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□—a LP sv 0-^> FMM 



RS pc 
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Fig. 7.1. CPU times for solving Laplace's equation in 2D and 3D using LAPACK/ATLAS 
(LP), FMM/GMRES (FMM), and recursive skeletonization (RS) as a function of the system size. 
For LP and RS, the computation is split into two parts: precomputation (pc), for LP consisting of 
matrix formation and factorization, and for RS of matrix compression and factorization; and system 
solution (sv), consisting of matrix (pseudo-) inverse application via precomputed QR factors. The 
precision of FMM and RS was set at e = 10 — 9 in 2D and 10 — 6 in 3D. Dotted lines indicate 
extrapolated data; for RS in 3D, only factorization and inversion ( as executed through Matlab ) are 
extrapolated. 



where 

G(r) 



-l/(27r)logr if d = 2, 
l/(47rr) if d = 3 



is the free-space Green's function, ug is the unit outer normal at y £ d£l, and a is 
an unknown surface density. Letting x approach the boundary, standard results from 
potential theory [32] yield the second-kind Fredholm boundary integral equation 

(7-1) ir-(\\Z-y\\) °{y)dy = m 



2 J do. dvy 

for cr, assuming that dtt is smooth. This is not a least squares problem, but it allows 
us to compare the performance of the sparse QR approach with our previous sparse 



LU results [38]. Of course, since the system (7.1| is square, the corresponding sparse 



embedding (2.4) can be solved without iteration. 

In 2D, we took as the problem geometry a 2:1 ellipse, discretized via the trape- 
zoidal rule, while in 3D, we used the unit sphere, discretized as a collection of flat 
triangles with piecewise constant densities. We also compared our algorithm against 
an FMM-accelerated GMRES solver driven by the open-source FMMLIB software 
package [25) . which is a fairly efficient implementation (but not optimized using the 
plane wave representations of [29 ). Timing results for each case are shown in Fig. 
|7.1[ with detailed data for the recursive skeletonization scheme given in Tables |7.1| 
and 7.2 The precision was set to e = 10~ 9 in 2D and 1CP 6 in 3D. 



It is evident that our method scales as predicted, with precomputation and solu- 
tion complexities of O(N) in 2D (d = 1), and 0{N 3 ^ 2 ) and O(JVlogiV), respectively, 
in 3D (d = 2). In 2D, both phases are very fast, easily beating the uncompressed LA- 
PACK/ATLAS solver in both time (0(N 3 ) and 0(N 2 ) for precomputation and solu- 
tion, respectively) and memory, and coming quite close to the FMM/GMRES solver 
as well. The same is essentially true in 3D over the range of problem sizes tested, 
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Table 7.1 

Numerical results for solving Laplace's equation in 2D at precision t = 10~ 9 : N, uncompressed 
matrix dimension; K r , final row skeleton dimension; K c , final column skeleton dimension; T cm , ma- 
trix compression time (s); T qr , sparse QR factorization time (s); T BV , (pseudo-) inverse application 
time (s); E, L 2 relative error. 



N 


K T 


K c 










E 




1024 


30 


30 


3.1b-2 


9.4e-2 


5.8e- 


3 


1.1b- 


9 


2048 


29 


30 


6.5e-2 


5.8e-2 


1.8b- 


2 


4.5e- 


9 


4096 


30 


30 


1.3E-1 


1.2e-1 


3.7e- 


2 


1.5b- 


8 


8192 


30 


31 


2.6e-1 


2.6e-1 


5.7e- 


2 


1.4b- 


8 


16384 


31 


31 


5.2e-1 


6.0e-1 


9.2e- 


2 


1.7b- 


8 


32768 


30 


30 


l.lE+0 


I.Oe+0 


2.0e- 


1 


1.2b- 


8 


65536 


30 


30 


2.1e+0 


2.0e+0 


3.4e- 


1 


1.6b- 


8 


131072 


29 


29 


4.1e+0 


4.7e+0 


6.8e- 


1 


2.2b- 


8 



Table 7.2 

Numerical results for solving Laplace's equation 3D at precision e = 10 — 6 . Extrapolated values 
are given in parentheses; all other notation as in Table \77l\ 



N 


K T 


K c 




1 q t 






E 




320 


320 


320 


2.3E-1 


5.1e-2 


4.5e- 


3 


1.5e- 


10 


720 


628 


669 


l.lE+0 


1.6E-1 


9.3e- 


3 


5.1e- 


07 


1280 


890 


913 


3.7e+0 


5.0e-1 


3.2e- 


2 


1.0b- 


06 


2880 


1393 


1400 


1.7E+1 


1.9e+0 


9.1b- 


2 


1.2b- 


06 


5120 


1886 


1850 


4.7e+1 


6.0e+0 


1.6b- 


1 


2.2e- 


06 


11520 


2750 


2754 


1.4e+2 


(1.9E+1) 


(3.9b- 


1) 






20480 


3592 


3551 


3.1e+2 


(4.6e+1) 


(7.4b- 


1) 







though it should be emphasized that FMM/GMRES has optimal O(N) complexity 
and so should prevail asymptotically. However, as observed previously [371 EH1 B2] > 
the solve time using recursive skeletonization following precomputation (comprising 
one application of Q* and one backsolve with R) is much faster than an individual 
FMM/GMRES solve: e.g., in 2D at N = 131072, T FMM = 3.9 s, while T sv = 0.7 
s. This is significantly slower when compared with our UMFPACK-based sparse LU 
solver [3S] (T sv ~ 0.1 s). The difference may be due, in part, both to a higher constant 
inherent in the QR approach and to the overhead from interfacing with Matlab. Un- 
fortunately, we were unable to perform the sparse QR factorizations in-core for the 
3D case beyond N ~ 10 4 ; the corresponding data are extrapolated via the results of 
section [5] 

7.2. Least squares fitting of thin plate splines. We next turned to an 
overdetermined problem involving 2D function interpolation using thin plate splines; 



see (1.3 1. More precisely, we sought to compute the coefficients aj of the interpolant 

N 

= ^2 a 3<hps(\\S-Sj\\), 

3 = 1 

that best matches a given function 

f(x, y) = sin(47ra;) + cos(27ry) sin(37rx?/) for x = (x, y) 

in the least squares sense on some set of randomly chosen targets a?{ G [0, l] 2 for 
i = l,..., M. The points Cj for j = 1, . . . , N denote the centers of the splines and lie 
on a uniform tensor product grid on [0, l] 2 . Since this is somewhat ill-conditioned, we 
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10 3 10 4 10 s 

System row dimension (M) 



o— o LP pc a— a LP sv • — • RS pc ■— ■ RS sv 



Fig. 7.2. CPU times for overdetermined thin plate spline fitting in 2D at precision e = 10 
using LAPACK/ ATLAS and recursive skeletonization as a function of the system row dimension 
M, with the column dimension fixed proportionally at N = M/A; all other notation as in Fig. \ 7. 1\ 

Table 7.3 

Numerical results for overdetermined thin plate spline fitting in 2D at precision e = 10~ 6 : 
M, uncompressed row dimension; N , uncompressed column dime nsion ; rii ter , number of deferred 
correction iterations; R, L 2 residual; all other notation as in Table \7\l] 



M 


N 


K T 


K c 




^ qr 


2 S V 


^itcr 


E 




R 




1024 


256 


174 


148 


7.7e-2 


1.7e-1 


5.3e-2 


l 


4.1e- 


5 


1.4e- 


l 


4096 


1024 


260 


247 


5.7e-1 


3.1e-1 


1.6E-1 


l 


8.3e- 


5 


4.4e- 


2 


16384 


4096 


399 


391 


3.1e+0 


1.5e+0 


I.Oe+0 


l 


3.9e- 


4 


1.6e- 


2 


65536 


16384 


564 


574 


1.5E+1 


7.0e+0 


4.7e+0 


l 


2.0e- 


6 


6.7e- 


3 



also add Tikhonov regularization with regularization parameter ft — 10 2 as indicated 



in (4.11 



Timing results for various M and N at a fixed ratio of M/N = 4 with e = 10 are 
shown in Fig. |7.2| with detailed data in Table [773] The results are consistent with our 
complexity estimates of 0(M + TV 3 / 2 ) and 0(M + N\ogN) for precomputation and 
solution, respectively. This compares favorably with the uncompressed complexities of 
0(MN 2 ) and 0(MN), respectively, for LAPACK/ ATLAS. Note also the convergence 
in the residual \\f(x) — g(x)\\ on the targets Xi of roughly second order. In all cases, 
the deferred correction procedure converged with just one iteration. 

7.3. Underdetermined charge fitting. We then considered an underdeter- 
mined problem: seeking a minimum- norm charge distribution in 2D. The setup is as 
follows. Let Xj for j — 1, . . . , N be uniformly spaced points on the unit circle, each 
associated with a random charge qj . We measure their induced field 

1 N 

f(x;q) = -— 2j^log||f-^|| 

i=i 

on a uniformly sampled outer ring of radius 1 + 5, and compute an equivalent set of 
charges qj with minimal Euclidean norm reproducing those measurements. Here, we 
set 6 = 10~ 4 and sampled at M = N/8 observation points over a range of N. 



Timing results at precision e = 10 are shown in Fig. 7.3 with detailed data given 



in Table 7.4 Since the source and target points are separated (by an annular region of 



width 6), our algorithm has optimal 0(M + N) complexity, which is readily observed. 
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System column dimension (N) 



o— o LP pc a— a LP sv • — • RS pc ■— ■ RS sv 



Fig. 7.3. CPU times for underdetermined charge fitting in 2D at precision e = 10 — 9 using 
LAPACK/ ATLAS and recursive skeletonization as a function of the system column dimension N, 
with the row dimension fixed proportionally at M = N/8; all other notation as in Fig. \ 7.1\ 

Table 7.4 

Numerical results for underdetermined charge fitting in 2D at precision e = 10 -9 ; notation as 
in Table\7H\ 



M 


N 


K x 


K c 




1 q r 




2 S V 




^itcr 


E 




R 




128 


1024 


69 


72 


1.7e-2 


2.1b- 


2 


1.6e- 


2 


l 


1.6e- 


9 


1.6e- 


15 


256 


2048 


80 


80 


4.4e-2 


4.3e- 


2 


5.8e- 


2 


2 


1.3b- 


-8 


1.1b- 


15 


512 


4096 


89 


90 


9.6e-2 


8.7e- 


2 


1.3b- 


I 


2 


6.1b- 


-8 


1.9e- 


15 


1024 


8192 


99 


100 


2.0e-1 


1.8e- 


I 


2.5e- 


I 


2 


5.5e- 


-8 


3.0e- 


15 


2048 


16384 


108 


110 


4.0e-1 


3.8e- 


I 


5.1e- 


I 


2 


3.6e- 


-8 


1.8b- 


15 


4096 


32768 


119 


119 


S.Ie-1 


8.2e- 


1 


1.2e4 


-0 


2 


3.5e- 


-8 


2.1e- 


15 


8192 


65536 


128 


131 


1.6e+0 


2.0e4 





2.6e4 


-0 


2 


4.8e- 


-9 


7.1b- 


09 


16384 


131072 


134 


138 


3.3e+0 


4.0e4 





5.5e4 


-0 


2 


6.6e- 


9 


7.5e- 


09 



Furthermore, as we have solved an approximate, compressed system, we cannot in 
general fit the data exactly. Indeed, we see residuals ||/(x; q) — f(x; q)\\ of order 0(e) 
as predicted by the compression tolerance. Thus, our algorithm is especially suitable in 
the event that observations need to be matched only to a specified precision. Deferred 
correction was successful in all cases within two iterations. 

7.4. Thin plate splines with updating. In our final example, we demonstrate 
the efficiency of our updating and downdating methods in the typical setting of fitting 
additional observations to an already specified overdetermined system. For this, we 
employed the thin plate spline approximation problem of section 7.2 with M — 16384 
and N = 4096, followed by the addition of 50 new, random target points. From 
section|6j the perturbed system can be written as (6.1) with (6.2 1, i.e., 



Ex ~ f where E — 



A 

CI 



We used GMRES with the left preconditioner B = [A + , (C+)*], which, since A 
has full column rank, gives BE = I + (C^)*C^, hence the preconditioned system is 



[j + (c;)*c; 



Bf. 



Note that only one application of A + is necessary, independent of the number of 
iterations required. Solving this in Matlab took 17 iterations and a total of 1.9 
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s, with 1.6 s going towards setup. The residual on the new data was 7.7 x 10 -3 . 
This should be compared with the roughly 6 s required to solve the problem without 
updating, treating it instead as a new system via our compressed algorithm (Table 



7.3). Although this difference is perhaps not very dramatic, it is worth emphasizing 
that the complexity here scales as 0(M + N log TV) with updating versus 
without, as the former needs only to apply A + while the latter needs also to compress 
and factor A. Therefore, the asymptotics for updating arc much improved. 

8. Generalizations and conclusions. In this paper, we have presented a fast 
direct algorithm for overdetermined and underdetermined least squares problems in- 
volving HBS matrices, and exhibited its efficiency and practical performance in a 
variety of situations including RBF interpolation and dynamic updating. In ID (in- 
cluding boundary problems in 2D and problems with separated data in all dimensions), 
the solver achieves optimal 0(M + N) complexity and is extremely fast, but it falters 
somewhat in higher dimensions, due primarily to the growing ranks of the compressed 



matrices as expressed by (5.1 1. Developments for addressing this growth are now un- 
derway for square linear systems [18] . and we expect these ideas to carry over to the 
present setting. Significantly, the term involving the larger matrix dimension is linear 
in all complexities (i.e., only O(M) instead of 0{MN 2 ) as for classical direct meth- 
ods), which makes our algorithm ideally suited for large, rectangular systems where 
both M and N increase with refinement. 

Although we have not explicitly considered least squares problems with HBS 
equality constraints (we have only done so implicitly through our treatment of un- 
derdetermined systems), it is evident that our methods generalize. However, our 
complexity estimates can depend on the structure of the system matrix. In particu- 
lar, if it is sparse, e.g., a diagonal weighting matrix, then our estimates are preserved. 
We can also, in principle, handle HBS least squares problems with HBS constraints 
simply by expanding out both matrices in sparse form. 

Finally, it is worth noting that fast direct solvers can be leveraged for other least 
squares techniques as well. This is straightforward for the normal equations, which 
are subject to well-known conditioning issues, and for the somewhat better behaved 
augmented system version [8] : 



I A 




r 




' b ' 


A* 




X 








This approach has the advantage of being immediately amenable to fast inversion 
techniques, but it symmetrizes the system. Thus, all complexity estimates involve 
M + N instead of M and N separately. In particular, the current generation of fast 
direct solvers would require, e.g., 0((M + jV) 3 ( 1 - 1 /<*)) instead of 0(M + iV 3(1 " 1/rf) ) 
work. With the development of a next generation of linear or nearly linear time 
solvers (THJ, [31] , this distinction may become less critical. Memory usage and high- 
performance computing hardware issues will also play important roles in determining 
which methods are most competitive. We expect these issues to become settled in the 
near future. 
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